Introduction to R
for Social Scientists

Workshop Day 2B | 2022-07-26
Jeffrey M. Girard | Pitt Methods

Visualize I

What is a graphic?

A data visualization expresses data through visual aesthetics.

Describing Graphics

Some simple graphics are easy to describe and may even have ready names.

Describing Graphics

A grammar of graphics will help us describe more complex graphics.

The Grammar of Graphics

  • The grammar of graphics is a set of rules for describing and creating data visualizations
  • To make our data visual (and therefore put our highly evolved occipital lobes to work)…
    • We connect variables to visual qualities
    • We represent observations as visual objects
  • This requires four fundamental elements
    • We will first learn about them in lecture
    • We will then apply them in R using {ggplot2}

Data

Graphics require data (e.g., tibbles), which describe observations using variables.

Aesthetic Mappings

Graphics require aesthetic mappings, which connect data variables to visual qualities.

Scales

Graphics require scales, which connect specific data values to specific aesthetic values.

Geometric Objects

Graphics require geometric objects (geoms), which represent the observations.

ggplot2 Basics

  • The ggplot2 package is a part of tidyverse
    • No need to install or load it separately
    • It plays nicely with tibbles and wrangling
  • It implements the grammar of graphics in R
    • The “gg” stands for “grammar of graphics”
    • Thus, we will need to provide all four elements
  • We will create a pseudo-pipeline of commands
    • However, we will use + rather than |>
    • This is because {ggplot2} predates the R pipe

ggplot2 Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# LESSON: First, set the data to a tibble
p <- ggplot(data = mpg)
p

# ==============================================================================

# LESSON: Next, set the aesthetic mappings with aes()

p <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
p

# ==============================================================================

# TIP: You can leave off the optional argument names

p <- ggplot(mpg, aes(x = displ, y = hwy))
p

# ==============================================================================

# LESSON: Next, set the positional scales

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
  scale_x_continuous(
    name = "Engine Size (in liters)", 
    limits = c(1, 7), 
    breaks = 1:7
  ) +
  scale_y_continuous(
    name = "Highway Fuel Efficiency (in miles/gallon)",
    limits = c(10, 50),
    breaks = c(10, 20, 30, 40, 50)
  )
p

# ==============================================================================

# LESSON: Finally, add a point geom

p <- 
  ggplot(mpg, aes(x = displ, y = hwy)) + 
  scale_x_continuous(
    name = "Engine Size (in liters)", 
    limits = c(1, 7), 
    breaks = 1:7
  ) +
  scale_y_continuous(
    name = "Highway Fuel Efficiency (in miles/gallon)",
    limits = c(10, 50),
    breaks = c(10, 20, 30, 40, 50)
  ) +
  geom_point()

# ==============================================================================

# TIP: If you leave off the scales, R will try to guess

p <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point()
p

# ==============================================================================

# LESSON: We can also customize the geom with arguments

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(color = "red", shape = "square", size = 2)
p

Basic Layering

  • ggplot2 uses a layered grammar of graphics
    • We can keep stacking geoms on top
  • Layering adds a lot of possibilities
    • We can convey more complex ideas
    • We can learn more about our data
  • But we can still describe these graphics
    • Just describe each layer in turn
    • And describe the layers’ ordering

Basic Layering Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# USECASE: Add a smooth geom (i.e., line of best fit)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(method = "lm")

# ==============================================================================

# USECASE: Add a line geom (i.e., connecting points)

economics

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_point()

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_point() +
  geom_line(color = "orange", size = 1)

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_line(color = "orange", size = 1) +
  geom_point()

# ==============================================================================

# USECASE: Add reference line geoms

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_hline(yintercept = 0, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point()

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_vline(xintercept = 7.5, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point() 

ggplot(economics, aes(x = date, y = unemploy)) + 
  geom_abline(intercept = 4000, slope = 0.5, color = "orange", size = 1) +
  geom_line(color = "blue", size = 1) +
  geom_point() 

Distribution Geoms

  • Variable distributions are critical in data analysis
    • What are the most and least common values?
    • What are the extrema (min and max values)?
    • Are there any outliers or impossible values?
    • How much spread is there in the variable?
    • What shape does the distribution take?
  • Visualization is a quick way to assess this
    • They can also communicate it to others

Distribution Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# USECASE: Creating histograms

ggplot(mpg, aes(x = hwy)) + 
  geom_histogram()

ggplot(mpg, aes(x = hwy)) + 
  geom_histogram(bins = 20)

ggplot(mpg, aes(x = hwy)) + 
  geom_histogram(binwidth = 2)

ggplot(mpg, aes(x = hwy)) + 
  geom_histogram(binwidth = 2, color = "red", size = 1)

ggplot(mpg, aes(x = hwy)) + 
  geom_histogram(binwidth = 2, color = "red", size = 1, fill = "white")

# ==============================================================================

# USECASE: Creating density plots

ggplot(mpg, aes(x = hwy)) + geom_density()

ggplot(mpg, aes(x = hwy)) + 
  geom_density(color = "red", size = 1, fill = "white")

# ==============================================================================

# USECASE: Creating box plots

ggplot(mpg, aes(x = hwy)) + geom_boxplot()

ggplot(mpg, aes(x = hwy, y = class)) + 
  geom_boxplot(varwidth = TRUE)

# ==============================================================================

# USECASE: Creating bar plots to count categorical variables

ggplot(mpg, aes(x = class)) + geom_bar()

# ==============================================================================

# PITFALL: Don't try to create histograms for categorical variables

ggplot(mpg, aes(x = class)) + geom_histogram() #error

Working with Color

  • Color scales come in two main types:
    • Discrete scales have separate colors
      • Best with factor variables
    • Continuous scales form a gradient
      • Best with numeric variables
  • There are two ways to control color:
    • You can map color to a variable
      • It will take on different values
    • You can set color to a value
      • It will take on one value only

Color Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# USECASE: Continuous color scales work well with numeric variables

ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +
  geom_point(size = 4)

ggplot(mpg, aes(x = hwy, y = cty, color = displ)) +
  geom_point(size = 4) +
  scale_color_continuous(type = "viridis")

# ==============================================================================

# USECASE: Use a discrete color scale with categorical variables

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
  geom_point() +
  scale_color_discrete(
    name = "Drivetrain", 
    breaks = c("4", "f", "r"), 
    labels = c("Four Wheel", "Front Wheel", "Rear Wheel")
  )

# ==============================================================================

# PITFALL: Don't forget to set categorical variables as factors

ggplot(mpg, aes(x = displ, y = hwy, color = cyl)) + 
  geom_point() # R guesses you want a continuous scale

ggplot(mpg, aes(x = displ, y = hwy, color = factor(cyl))) + 
  geom_point() + 
  scale_color_discrete(name = "Cylinders")

# ==============================================================================

# LESSON: Set a geom's color aesthetic to make it always that color

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "red")

# ==============================================================================

# PITFALL: However, do this inside of geom() not aes()

ggplot(mpg, aes(x = displ, y = hwy, color = "blue")) + 
  geom_point() #unintended

# ==============================================================================

# LESSON: If you both set and map color, the setting will win

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + 
  geom_point(color = "blue") 

Other Aesthetics

  • For blocky elements like bars…
    • color controls the outline color
    • fill controls the internal color
    • size controls the line thickness
  • Some mappings will induce grouping
    • You’ll get separate geoms per group
  • It can be helpful to use redundant mapping
    • Map one variable to multiple aesthetics
    • Then if one “fails” the other may work

Other Aesthetics Live Coding

# SETUP: We will need tidyverse and an example dataset

library(tidyverse)

mpg

# ==============================================================================

# USECASE: Mapping the shape aesthetic to a categorical variable

ggplot(mpg, aes(x = displ, y = hwy, shape = drv)) +
  geom_point(size = 3)

# ==============================================================================

# PITFALL: Don't try to map shape to a continuous variable

ggplot(mpg, aes(x = displ, y = hwy, shape = hwy)) + 
  geom_point() #error

# NOTE: This doesn't work because there are way more numbers than shapes

# ==============================================================================

# LESSON: Color vs. Fill and Size for Blocks

ggplot(mpg, aes(y = class)) + 
  geom_bar()

ggplot(mpg, aes(y = class)) + 
  geom_bar(color = "darkred", fill = "lightblue", size = 1)

# ==============================================================================

# LESSON: Some aesthetics cause grouping when mapped to a categorical variable

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(method = "lm") # single smooth

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(method = "lm") # three smooths

# ==============================================================================

# USECASE: Mapping to the fill aesthetic and setting the alpha property

ggplot(mpg, aes(x = hwy, fill = drv)) + 
  geom_density()

ggplot(mpg, aes(x = hwy, fill = drv)) + 
  geom_density(alpha = 0.3)

# ==============================================================================

# TIP: If you map the same variable to multiple aesthetics, you get redundancy

ggplot(mpg, aes(x = displ, y = hwy, shape = drv, color = drv)) +
  geom_point(size = 3) # if color fails, shape still works

Visualize II

Describe this Graphic 1

Data

  • starwars {tidyverse}

Aesthetics/Scales

  • height to X (continuous)
  • mass to Y (continuous)

Geoms

  • Point (dots)
  • Smooth (local)

Describe this Graphic 2

Data

  • mpg {tidyverse}

Aesthetics/Scales

  • displ to X (continuous)
  • hwy to Y (continuous)
  • drv to color (discrete)

Geoms

  • Point (dots)
  • Smooth (linear)

Describe this Graphic 3

Data

  • mpg {tidyverse}

Aesthetics/Scales

  • hwy to X (continuous)
  • class to Y (discrete)

Geoms

  • Boxplot (fill = lightblue)
  • VLine (xintercept = 20)

Describe this Graphic 4

Data

  • flights {nycflights13}

Aesthetics/Scales

  • origin to X (discrete)
  • origin to color (discrete)
  • count to Y (stat from geom)

Geoms

  • Bar (fill = white)

Themes

  • Themes control how non-data elements look
    • e.g., how thick to draw the gridlines
    • e.g., where to position the legend
  • Complete themes change many elements at once
    • Some are built into ggplot2
    • Others come in R packages
    • {papaja} provides theme_apa()
  • Individual elements can be customized too

Themes Live Coding

# SETUP: We will need tidyverse and an example graphic

library(tidyverse)

p <- 
  ggplot(mpg, aes(x = displ, y = hwy, color = drv)) + 
  geom_point() +
  labs(title = "Fuel Efficiency")
p

# ==============================================================================

# USECASE: Apply a "complete" theme

p + theme_bw()

p + theme_classic()

# ==============================================================================

# TIP: You can quickly change the font size of all elements with base_size

p + theme_grey(base_size = 24)

# ==============================================================================

# LESSON: The ggthemes package adds some fun complete themes

library(ggthemes)

p + theme_wsj()

p + theme_economist()

p + theme_stata()

# ==============================================================================

# LESSON: More more precise control, we can use theme()

p + theme(legend.position = "top")

p + theme(plot.title = element_text(color = "purple", face = "bold"))

p + theme(panel.grid = element_blank())

# NOTE: There are a lot of elements to learn, so use a cheatsheet!

Exporting Graphics

  • We may need to export graphics from R
    • e.g., for a paper, poster, or presentation
  • This job is handling fantastically by ggsave()
    • We can create many types of files
    • We can customize the exact size
  • I recommend .png for most daily purposes
    • For publishing, I prefer .pdf or .svg
    • They retain perfect quality at any zoom
    • You can send these files to most publishers

Exporting Live Coding

# SETUP: We will need tidyverse and an example graphic

library(tidyverse)

p <- ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point() + geom_smooth() +
  labs(x = "Engine Displacement", y = "Highway MPG")
p

# ==============================================================================

# USECASE: Save a specific ggplot object to a file

ggsave(filename = "pfinal.png", plot = p)

# ==============================================================================

# LESSON: Specify the size of the file to create

ggsave(filename = "pfinal2.png", plot = p, 
       width = 6, height = 3, units = "in")

# ==============================================================================

# LESSON: Just change the extension to create a different file type

ggsave(filename = "pfinal2.pdf", plot = p, 
       width = 6, height = 3, units = "in")

# ==============================================================================

# PITFALL: Creating a very large file may lead to small text

ggsave(filename = "p_poster.png", plot = p, 
       width = 12, height = 8, units = "in")

# ==============================================================================

# TIP: You can quickly increase the text size using base_size

p2 <- p + theme_grey(base_size = 24)

ggsave(filename = "p_poster2.png", plot = p2,
       width = 12, height = 8, units = "in")

Practice IV